A Survey of Recoverable Distributed Shared Memory Systems
نویسندگان
چکیده
Distributed Shared Memory (dsm) systems provide a shared memory abstraction on distributed memory architectures (distributed memory multicomputers, networks of workstations). Such systems ease parallel application programming since the shared memory programming model is often more natural than the message-passing paradigm. However, the probability of failure of a dsm system increases with the number of sites. Thus, fault tolerance mechanisms must be implemented in order to allow processes to continue their execution in the event of a failure. This paper gives an overview of recoverable dsm systems (rdsm) that provide a checkpointing mechanism to restart parallel computations, after a site failure. Une synth ese des syst emes a m emoire virtuelle partag ee recouvrables R esum e : Les syst emes a m emoire virtuelle partag ee oorent a leurs utilisateurs l'illusion d'une m e-moire partag ee sur les architectures a m emoire distribu ee (r eseaux de stations de travail, machines parall eles a m emoire distribu ee). De tels syst emes facilitent la programmation des applications pa-rall eles, car le mod ele de programmation par partage de m emoire est souvent plus naturel que le mod ele de programmation par echange de messages. Toutefois, plus le nombre de composants dans un syst eme a m emoire virtuelle partag ee augmente, plus la probabilit e qu'une d efaillance se pro-duise est importante. Des m ecanismes de tol erance aux fautes doivent par cons equent ^ etre ajout es aux syst emes a m emoire virtuelle partag ee. Ce rapport eeectue un tour d'horizon des m ecanismes de sauvegarde et restauration de points de reprise dans les syst emes a m emoire virtuelle partag ee (m emoires virtuelles partag ees recouvrables). Ces m ecanismes permettent de poursuivre l'ex ecution d'une application parall ele en d epit de la d efaillance d'un site.
منابع مشابه
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM). Although most recover...
متن کاملResearch on Adaptive and Recoverable Distributed Shared Memory
Software distributed shared memory (DSM) systems have many advantages over message passing systems. Since DSM provides a user a simple shared memory abstraction, the user does not have to be concerned with data movement between hosts. Many applications programmed for a multiprocessor system with shared memory can be executed on a software DSM system without significant modifications. This paper...
متن کاملAn Extended Coherence Protocol for Recoverable DSM Systems with Causal Consistency
This paper presents a coherence protocol for recoverable Distributed Shared Memory (DSM) systems with causally consistent read-write objects. It uses independent checkpointing tightly integrated with coherence operations. That integration results in high availability of shared objects and ensures fast restoration of the consistent state of DSM in spite of multiple node failures, introducing lit...
متن کاملArchitectural Issues in Adopting Distributed Shared Memory for Distributed Object Management Systems
Distributed shared memory (DSM) provides transparent network interface based on the memory abstraction. Furthermore, DSM gives us the ease of programming and portability. Also the advantages ooered by DSM include low network overhead, with no explicit operating system intervention to move data over network. With the advent of high-bandwidth networks and wide addressing, adopting DSM for distrib...
متن کاملReplication for Efficiency and Fault Tolerance in a Dsm System
Distributed Shared Memory (DSM) systems implemented on a network of workstations (NOW) have become a convenient alternative to shared memory archi-tectures to execute long running parallel applications. However, such architectures are susceptible to experience failures. This paper presents the design and implementation of a recoverable DSM (RDSM) based on a backward error recovery (BER) mechani...
متن کامل